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Abstract 


In  this  study  wc  introduce  and  test  several  methods  to  reduce  the  computational  cost  m 
dynamic  programming  algorithms  for  isolated  word  recognition  systems.  Three  methods  ««  <* 
discussed  in  detail:  1.)  Pruning  by  preset  thresholds  2.)  Search  based  on  the  Umnch  and  Bound 
technique  3.)  Branch  and  Bound  based  search  with  additional  pruning.  Compared  to 
conventional  algorithms,  Method  3.)  could  be  seen  to  yield  a  speed  up  of  appro*,  mutely  a  factor 
of  5  at  no  loss  of  rccognilion  accuracy.  The  branch  and  bound  method  with  pruntng  ts  a  so 
ideally  suited  for  research  oriented  systems,  since  pruning  is  independent  of  the  parametr, tattoo 
used  (eliminates  the  necessity  for  retuning  thresholds).  Addttional  features  of  Bus  method,  whteh 
are  of  importance  to  maintaining  the  flexibility  and  diagnoslicily  needed  for  such  a  system,  w, 


be  discussed. 
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1 .  Introduction 

For  the  development  of  practical  speech  recognition  systems,  computation  speed  is  one  of  the 
predominant  design  factors.  Several  commercially  available  systems  still  employ-in  terms  of 
recognition  accuracy-inferior  linear  time  normalisation  techniques  to  account  for  speaking  rate 
variations,  since  the  dynamic  programming  (DP)  -technique  is  computationally  very  costly.  Even 
in  a  research  environment,  the  turn-around  time  for  larger  experimental  inns  over  large  speech 
data-bases  can  easily  be  in  the  order  of  days  or  weeks.  Consequently,  several  methods  have  been 
employed  to  reduce  die  redundancies  in  isolated  word  recognition  systems.  Referring  to  the 
commonly  used  DP-matching  techniques,  as  used  by  Sakoc  and  Chiba,  Itakura,  Rabincr  and 
others123,  it  can  be  seen  that  the  bottleneck  of  nonlinear  time  normalization  is  given  by  the 
number  of  points  within  a  matrix-defined  by  the  frames  of  an  unknown  utterance  x  and  a 
known  reference  utterance  y-that  are  needed  to  find  an  optimum  matching  path.  The 
computation  needed  for  each  of  these  points  includes  the  computation  of  a  distance  between  the 
particular  test-frame  and  reference-frame  under  consideration  and  the  derivation  of  a  cumulative 
score  defined  by  the  constraints  of  the  DP-matching  algorithm  in  use.  In  a  computer  program 
that  performs  DP-matching.  dicse  operations  will  typically  constitute  the  innermost  loop  and 
therefore  be  the  most  repetitious  and  most  expensive  in  time.  Finding  less  expensive  warping 
constraints  or  distance  functions,  however,  will  in  most  cases  yield  a  loss  in  recognition  accuracy. 
Two  uthei  TrtWhodSi  hast*  bee  a  uud  by  Sakvd  Ac  Cuba1  by  Kabinet3.  The  fatt.  is 
definition  of  a  window1  around  the  diagonal  of  the  w  arping  matrix  that  defines  the  boundaries  of 
any  allowable  warping  path.  This  definition  is  not  only  useful  but  also,  for  some  warping 
functions,  needed  co  prohibit  possible  ftutiTmguisric  paths  through  tho  matrix.  Reduction  ol  thu 
width  of  this  window  thus  increases  computational  speed  significantly.  It  has  been  shown4  that  a 
window  that  restricts  the  warp  search  path  to  lead  or  lag  behind  a  linearly  time-normalized  match 
by  not  more  titan  50  msecs  is  the  optimal  choice  for  an  isolated  word  recognition  system  using  an 
alpha-digit  vocabulary.  Such  a  window  constraint  was  seen  to  not  only  provide  a  computational 
saving  of  up  to  70%  but  also  in  some  cases  to  increase  recognition  accuracy. 
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ll  should  be  nolcd  thru,  in  lhai  paper,  the  window  was  noi  mainly  introduced  for  efficiency  reasons. 
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Further  methods  hasc  been  suggested  to  increase  computational  efficiency.  In  the  following 
chapter  we  will  briefly  describe  a  method  suggested  by  Rabincr  ct  al3  and  then  introduce  two 
alternate  methods.  In  the  third  chapter  we  will  report  the  results  of  extensive  testing  on  all 
methods  reported  here. 


4 


2.  Efficient  Algorithms  for  Non-linear  Time 
Warping 

In  this  chapter  we  will  describe  uiree  methods  currently  in  use  in  our  isolated  word  recognition 
system  to  perform  dynamic  programming  in  an  efficient  manner.  Most  methods  arc  based  on 
the  idea  Uint-analogous  to  the  presumed  strategy  of  human  perception  --selection  of  the  correct 
candidate  out  of  a  reference  vocabulary,  can  be  performed  in  an  anticipatory  way,  by  process  of 
elimination.  In  other  words,  particularly  inappropriate  candidates  can  be  discarded  comparably 
early  in  the  matching  process,  i.e.,  the  match  can  be  aborted. 

2.1  Preset  Thresholds 

In  this  way,  Rabiner  ct  al.  have  obtained  significant  reductions  in  computation  cost.  Two 
thresholds  are  predefined,  denoted  Tmin  and  Tslopc.  The  computation  of  the  warp  is  performed 
by  computing  the  distances  and  the  Itakura  warping  function2  between  a  given  frame  i  in  the  test 
token  and  a  column  (specified  by  die  search  space)  of  reference  frames  (see  Fig.2-1).  For  each  of 
these  grid  points  a  cumulative  dissimilarity  score  of  the  best  path  leading  to  this  point  is  obtained 
in  this  fashion.  The  minimum  score  out  of  these  cumulative  scorcs-"localmin"-is  determined 
and  compared  to  the  threshold  Tj.2 

If  localmin>Tj  the  warp  is  aborted  and  recognition  proceeds  to  the  next  candidate;  Tj  is  given 
by 

Tj  =  (Tmin  +  i'Tslope)N 

where  N  is  the  number  of  frames  in  the  test  utterances.  Referring  to  Fig.2-2,  it  can  be  seen  that 
Tslopc  can  be  viewed  as  N  times  die  average  distance  diat  can  be  added  to  die  cumulative  score 
along  the  search  path  without  causing  the  pruning  mechanism  to  abort  the  match.  The  factor  N 
provides  a  further  adjustment  depending  on  utterance  length.  Both  Tmin  and  Tslopc  have  to  be 
set  in  such  a  fashion  that  they  minimize  computation  (for  efficiency)  but  are  generous  enough  to 
not  degrade  recognition  performance  (e.g.,  by  aborting  "a  good  match"). 

2Nolc  that  the  techniques  described  here  would  ha\c  io  be  altered  if  difTcrenl  warping  algorithms  were  used.  The 
Itakura  algorithm  appears  particularly  practical  for  these  methods. 
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Restriction  of  the  Search  Space  via  an  Adjustment  Window 
The  dotted  area  indicates  computational  saving  through  the  use 
of  the  window  constraint.  Tolerance  t  is  used  as  a  measure 
of  the  width  as  well  the  saving  achieved. 

Figure  2-1:  Warping  Plane  Indicating  the  Search  Space  of  the  Itakura  Algorithm 
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N  *  Tmin 


Figure  2-2:  Paining  Using  the  Preset  Thresholds  Tmin  and  Tslope 

2.2  Branch  and  Bound 

In  a  research  oriented  speech  recognition  system  it  is  for  experimentation  sometimes  desirable 
to  ensure  that  recognition  results  arc  not  affected  by  pruning  mechanisms,  i.c.,  that  they  arc 
guaranteed  to  reflect  the  differences  in  the  overall  dissimilarity  scores  derived  from  all  matches, 
only.  Nevertheless,  one  would  want  to  avoid  unnecessary  computation.  This  is  provided  by  a 
method  that  is  based  on  the  "branch  and  bound"  search  technique,  'lhis  technique  requires  that 
the  various  matches  of  a  recognition  be  performed  in  parallel. 

What  we  mean  by  this  "parallel  warping"  technique  is  illustrated  in  F'ig.2-3.  Instead  of 
performing  all  matches  sequentially,  each  frame  i  in  the  test-token  is  matched  with  the 


Referenceframes 


Testframes 


Figure  2*3:  Parallel  Warping  Planes 

corresponding  frames  of  the  K  reference  tokens  of  a  K-word  vocabulary.  Fig.2-3  illustrates  this 
technique  by  adding  a  dimension  (k)  to  die  warping  process  (usually  depicted  as  a  warping 
plane).  In  this  fashion  K  warping  planes  are  considered  at  a  time.  Information  about  the 
goodness  of  the  matches  with  all  the  tokens  in  the  reference  vocabulary  is  available  at  all 
intermediate  stages  i  during  the  warp.  Several  methods  to  panic  comparatively  bad  matches 
suggest  ihcmschcs.  For  the  "branch  and  bound"-bascd  technique,  however,  we  do  not  pome 
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away  a  bad  match.  Rather,  only  the  so  far  least  expensive  match  (the  one  with  the  so  far  lowest 
”localmin"-value)  is  expanded.  'ITiis  means  that,  instead  of  warping  a  particular  test-frame  i 
against  the  various  frames  of  the  K  reference  tokens,  the  so  far  best  match  out  of  the  K  matches 
is  warped  (thus  proceeding  in  / )  regardless  of  the  momentary  position  in  /  of  its  search  path. 
This  method  is  illustrated  by  Fig.2-4,  which  depicts  the  projection  of  the  search  paths  onto  the  ik- 
planc  for  the  parallel  warp  and  the  "branch  and  bound"--bascd  parallel  warp.  Clearly,  in  the 
branch  and  bound  method  bad  matchcs--i.c.  matches  between  strongly  differing  speech  signals- 
will  accumulate  high  distances  and  therefore  be  left  behind.  As  soon  as  the  best  match  reaches 
the  end  of  the  test  utterance,  the  recognition  process  is  completed.  Thus,  implicit  pruning  is 
performed  on  all  other  matches.  This  method  has  the  advantage  of  guaranteeing  that  the  lowest 
dissimilarity  score  will  be  found  and  thus  it  provides  identical  recognition  results  as  if  no  pruning 
were  performed.  As  an  additional  advantage  for  research  oriented  system,  it  should  be  noted  that 
users  can  specify  a  value  n  to  obtain  the  n  best  matches  in  the  recognition,  while  the  least  amount 
of  computation  is  being  performed  necessary  to  obtain  the  n  best  matches.  However,  if  n»l,  of 
course,  the  computational  saving  will  be  minimal. 

2.3  Branch  and  Bound  with  Pruning 

In  many  eases,  such  as  practical  recognition  systems  as  well  as  during  large  production  runs  of 
research  oriented  recognition  systems,  it  often  docs  not  matter  to  preserve  die  exact  individual 
recognition  outcomes,  as  long  as  the  overall  number  of  cnors  is  not  increased  when  pnining  is 
performed.  If  this  is  the  case,  the  branch  and  bound  method,  described  above,  can  be  further 
extended  to  further  reduce  computation  time.  Thus,  every  time  a  path  is  expanded  by  means  of 
continuing  its  warp,  the  number  of  frames  that  its  search  path  is  then'  leading  before  or  lagging 
behind  any  other  path  is  determined.  If  this  number  exceeds  the  threshold  Leadt,  this  other  padi 
is  pruned  off.  Leadt  is  given  by 
Leadt  =  P/100  N  +  1 

where  P  is  a  user-defined  percentage  and  N  the  number  of  frames  in  the  test  utterance. 

Thus  using  the  illustration  in  Fig.2-4.  if  we  were  expanding  path  1  to  ij  and  if  ij-i,>Lcadt, 
match  2  would  be  aborted.  In  addition  to  drastically  decreasing  the  computational  effort,  this 
pruning  method  is  entirely  independent  of  die  numerical  values  of  the  distances,  scores,  and 
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Figure  2-4:  Expanding  Search  Paths  in  Parallel  Warping  Algorithms 

spectral  coefficients.  It  is  therefore  ideally  suited  for  systems  in  a  developmental  stage.  Using 
other  pruning  methods,  frequent  changes  in  the  representation  of  the  speech  signal  would  cause 
the  necessity  for  repealed  retiming  of  thresholds  to  optimally  trade  off  recognition  accuracy  and 
computational  saving. 
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3.  Testing 

As  a  measure  of  the  computation  needed  using  the  algorithms  described  above,  we  use  the 
total  number' of  grid  points  (of  the  warp  search  space)  that  were  computed  for  each  speaker  and 
the  run  time.  As  testing  conditions,  die  algorithms  were  run  on  5  data  sets,  36  utterances  each 
(the  alpha-digii  .^cabulary)  for  8  speakers  (4  male,  4  female).  As  reference  data-set  for  cacn 
speaker,  a  36-uttcrar.ce  reference  set  was  generated  from  5  additional  readings  of  the  vocabulary 
J.  A  detailed  description  of  the  recognition  system  can  be  found  elsewhere4.  It  should  be 
pointed  out,  however,  that  entirely  automatic  endpoint  detection  was  used;  no  manual  tuning 
was  performed.  Some  of  the  recognition  errors  reported  in  these  results  arc  due  to  errors  in  the 
endpoint  detection. 

The  results  of  these  experimental  runs  arc  shown  in  figures  3-1  through  3-6. 

The  computational  cost  of  the  various  algorithms  tested  is  presented  in  figures  3-1  and  3-2. 
The  criterion  for  these  graphs  was  to  minimize  cost  under  the  constraint  of  maintaining  die  same 
or  reducing  error  rate  as  compared  to  a  conventional  lgorithm.  The  results  arc  presented  in 
Fig.3-1  in  terms  of  the  number  of  grid  points  needed  to  compute  the  180  recognition  of  the  test 
dam  base  and  in  3-2  in  terms  of  die  average  run  time  per  recognition  in  msec.  The  first  measure 
was  chosen  to  provide  a  machine  independent  estimate  of  die  savings  obtained.  As  can  be  seen 
from  Fig.3-2  in  comparison  to  Fig.3-1,  this  docs  not  directly  translate  into  run  time 
improvements,  as  we  reduce  die  number  of  grid  points.  This  is  so,  since  in  those  cases,  the 
number  of  grid  points  ceases  to  be  die  predominant  factor  contributing  to  computational  cost 
and  dr  overhead  outside  die  innermost  warping  loop  has  to  be  considered.  In  both  graphs, 
algorithm  1  -labeled  no  pruning,  no  window  -  performs  an  exhaustive  search  of  die  itakura  warp2, 
algoridim  2  (no  pruning,  t=5  window)  is  algorithm  1  with  die  additional  adjustment  window 
constraint,  that  was  previously  reported4  to  yield  better  performance  in  accuracy  and  efficiency. 
Finally,  the  results  for  the  algoridims  3,  4,  5,  are  shown,  i.c.,  for  the  branch  and  bound  with  no 
pruning  (i.e.,  P  =  100),  the  method  of  preset  dircsholds  and  the  branch  and  bound  mcdiod  with 
pruning  (P  =  1 5),  as  described  earlier.  Using  the  fastest  algorithm,  our  particular  implementation 
of  the  recognition  system  (miming  on  a  VAX-780)  operates  in  less  dian  2.5  times  real  time. 
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Figure  3-4:  Number  of  Grid  Points  [180  Recognitions]  for  8  Speakers  vs.  Pruning  Factor  P 
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Comparing  Fig.3-1  with  Fig.3-2,  we  also  see  that  the  run  time  improvements  as  given  by  the 
branch  and  bound  method  with  pruning  arc  not  as  substantial  as  indicated  by  the  saving  of  grid 
points  in  Fig.3-1.  This  behavior  is  due  to  the  larger  overhead  needed  to  perform  the  branch  and 
bound  search  (comparing  local  minima  among  die  references).  This  indicates  that  for 
vocabularies  substantially  larger  than  40  words  an  alternate  search  strategy  might  result  in  faster 
operation. 

In  figure  3-3  the  recognition  results  for  the  branch  and  bound  based  pruned  algorithm  are 
shown  for  various  values  of  the  pruning  factor  P.  Results  are  plotted  separately  for  the  eight 
speakers  in  our  data  base  and  are  given  in  terms  of  error  rate  (percent  confused).  Notice  that 
P=100  means  that  no  pruning  is  performed  and  the  algorithm  operates  based  on  the  branch  and 
bound  technique  only.  From  figure  3-3  it  can  be  seen  tMat  pruning  does  not  necessarily  have  to 
be  associated  with  an  increase  in  error  rate.  In  fact  for  P=  15  or  20,  improvements  of  up  to  3%  or 
4%  can  be  observed  for  some  speakers.  This  is  true,  since  in  many  cases  two  initially  badly 
matching  utterances  such  as  a  "B"  and  a  "C",  will  be  prevented  from  leading  to  confusion  due  to 
the  pruning  operation.  Notice  also  -independently  of  the  pruning  and  in  agreement  with  earlier 
results4  and  other  studies-  the  relatively  high  speaker  dependency  of  the  recognition  results. 

Figure  3-4  depicts  the  corresponding  computational  cost  in  terms  of  the  number  of  grid  points 
in  the  search  space,  i.e.,  the  number  of  times  ihe  innermost  loop  of  the  algorithm  has  to  be 
executed  for  the  ISO  recognitions  of  the  testing  corpus.  As  could  be  expected,  the  number  of  grid 
points  that  need  to  be  computed  decreases  monotonically  with  decreasing  pruning  factor.  A 
pruning  factor  of  15  or  20  (which  yields  acceptable  recognition  performance)  will  achieve  a 
reduction  of  grid  points  by  a  factor  of  2  to  3.  Notice  also,  tliat  while  the  curves  for  the  different 
speakers  behave  similarly  as  a  function  of  the  pruning  factor  in  a  qualitative  way,  their  actual 
quantitative  values  do  differ  strongly  across  speakers.  'Ibis  speaker  dependency  in  speed  (up  to  a 
factor  of  tw  o)  has  to  be  considered  should  certain  run  times  be  required  in  a  practical  recognition 
system. 

To  summarize  these  observations  in  a  very  crude  way  we  have  taken  the  freedom  for  the 
purpose  of  illustration  to  average  our  results  over  the  eight  speakers  ns  shown  in  figures  3-5  and 
3-6.  A  value  of  l’=  20  can  be  seen  to  yield  lowest  error  rates  while  a  \alue  of  P=  15  still  leads  to 
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equivalent  performance.  This  suggests  that  enough  discriminatory  confidence  is  accumulated 
when  the  path  of  an  inappropriate  reference  candidate  falls  behind  die  best  path  by  more  than  20 
percent  of  die  length  of  the  test  token.  This  result  shows  that  a  search  algoridim  with  pinning, 
i.c.,  an  algorithm  that  docs  NOT  perform  an  exhaustive  search  for  the  optimal  score  often  times 
improves  performance  by  virtue  of  imposing  addidonal  constraint  on  the  search.  This 
observation  is  consistent  with  previous  results  concerning  die  optimal  choice  of  an  adjustment 
window4. 

The  cost  (in  terms  of  grid  points)  averaged  across  speakers  obtained  by  such  pruning  can  be 
inferred  from  figure  3-6.  Note,  diat  if  one  attempts  to  meet  certain  performance  goals,  it  is  better 
to  use  the  data  obtained  for  a  speaker  with  die  highest  run  times,  radicr  than  the  average  across 
speakers.  For  die  purpose  of  comparison,  however,  we  have  chosen  to  present  the  data  in  this 
way. 

To  obtain  optimal  performance  data  for  the  preset  thresholds  algorithm,  two  thresholds  were 
determined  empirically,  providing  an  error  rate  equivalent  to  an  algoridim  where  no  pruning  was 
performed,  while  minimizing  for  computational  cost. 
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4.  Summary  and  Conclusion 

Wc  have  shown  that  the  branch-and-bound-with-pruning-method  is  the  fastest  algorithm  of 
all  the  methods  we  have  investigated.  More  importantly  it  is  insensitive  to  changes  in  parametric 
representation  or  in  vocabulary.  This  insensitivity  to  changes  proves  to  be  very  beneficial  in 
research  systems  when  almost  all  aspects  of  the  system  arc  changing  continuously,  since 
optimization  and  tuning  of  thresholds  is  usually  time  consuming  and  cumbersome. 


More  specifically,  the  advantages  are: 

•  This  method  yields  5  times  faster  operation  for  our  recognition  system  than 
performing  a  conventional  exhaustive  search.  (Using  a  lead  threshold  of  15%  of  the 
length  of  the  test  token  (P  =  15)  as  pruning  factor  in  addition  to  a  branch  and  bound 
based  search  method  (which  yields  approximately  die  same  error  rate  as  without 
pruning)  and  using  a  search  space  window  of  ±50  msec4) 

•  Substantial  cost  reductions  were  achieved  due  to  the  insensitivity  of  the  algorithm  in 
face  of  system  changes  such  as  changes  in  parametric  representation  or  vocabulary, 
thus  eliminating  the  need  for  costly  retuning. 

•  Flexible  pruning  thresholds  (from  no  paining  at  all  up  to  rigid  paining)  allow  to 
manually  trade  off  efficiency  and  recognition  performance,  if  so  desired. 

•  If  no  pruning  is  performed,  the  algorithm  reduces  to  the  branch  and  bound  search 
guaranteeing  optimality.  This  provides  identical  results  as  exhaustive  search,  while 
reducing  the  computational  cost  by  about  60%. 

•  It  is  also  possible  to  compute  the  guaranteed  n-best  candidates  while  obtaining  more 
efficient  operation. 


The  disadvantage  of  this  technique  is  its  higher  requirements  for  primary  memory  storage, 
since  several  matches  are  operated  on  "in  parallel".  For  systems  with  insufficient  local  memory, 
fast  software  implementations  of  such  a  technique  and  Vl.SI-implcinentations  might  therefore  be 
faced  more  severely  by  the  problem  of  performing  fast  I/O  than  by  doing  the  computation 
necessary  for  recognition. 
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